Lab2 Exploring Image Data

47491968 Yuankai Liu
47511663 Jianing Zhao
47564911 Yang Peng

1. Business Understanding

This dataset is comprised of images about handwritten numbers from 0 to 9. The number of images for each number that we used is about 600, and there are more than 6000 images in total. Each image has 28 * 28 pixels.

The purpose of the dataset we selected is that we all have our own writing habits, so identifying handwritten numbers is a really valuable work for academia or for business. In our lab, we will try to make predictions to recognize the numbers in those images. We think some third parties would be interested in this results because it can reduce some simple but boring work for employees.

In this lab, we should ensure the correct rate is high enough, otherwise, it will be meaningless work. So we need to find a proper number of features for our algorithm.

2. Data preparation

2.1 Read in image

In this part, the original dataset has 60000 images in total, and we randomly took out ten percent of images for our lab.

In [3]:
import matplotlib.pyplot as plt
import skimage.io as io
from skimage import data_dir

str1='digital/0/*.png'
coll0 = io.ImageCollection(str1)
In [4]:
import random
data=[]
y=[]
limit=10
for i in range(0,len(coll0)):
    if  random.randint(0, 100)<=limit:
        data.append(coll0[i])
        y.append(0)

coll1 = io.ImageCollection('digital/1/*.png')
for i in range(0,len(coll1)):
    if  random.randint(0, 100)<=limit:
        data.append(coll1[i])
        y.append(1)
   
coll2 = io.ImageCollection('digital/2/*.png')
for i in range(0,len(coll2)):
    if  random.randint(0, 100)<=limit:
        data.append(coll2[i])
        y.append(2)

coll3 = io.ImageCollection('digital/3/*.png')
for i in range(0,len(coll3)):
    if  random.randint(0, 100)<=limit:
        data.append(coll3[i])
        y.append(3)
   
coll4 = io.ImageCollection('digital/4/*.png')
for i in range(0,len(coll4)):
    if  random.randint(0, 100)<=limit:
        data.append(coll4[i])
        y.append(4)

coll5 = io.ImageCollection('digital/5/*.png')
for i in range(0,len(coll5)):
    if  random.randint(0, 100)<=limit:
        data.append(coll5[i])
        y.append(5)
   
coll6 = io.ImageCollection('digital/6/*.png')
for i in range(0,len(coll6)):
    if  random.randint(0, 100)<=limit:
        data.append(coll6[i])
        y.append(6)

coll7 = io.ImageCollection('digital/7/*.png')
for i in range(0,len(coll7)):
    if  random.randint(0, 100)<=limit:
        data.append(coll7[i])
        y.append(7)
   
coll8 = io.ImageCollection('digital/8/*.png')
for i in range(0,len(coll8)):
    if  random.randint(0, 100)<=limit:
        data.append(coll8[i])
        y.append(8)

coll9 = io.ImageCollection('digital/9/*.png')
for i in range(0,len(coll9)):
    if  random.randint(0, 100)<=limit:
        data.append(coll9[i])
        y.append(9)
In [5]:
print(len(data))
print(len(y))
6458
6458

2.2 Linearize the images

Linearize the images to create a table of 1-D image features. Original image size is 28 * 28, so 784 is the number of features in 1-D images.

In [6]:
import numpy as np
a=np.array(data)
c=[]
for i in range(len(data)):
    c.append(a[i].reshape(-1)) 
d=np.array(c)
In [77]:
from sklearn.cross_validation import train_test_split




X=d
y=np.array(y)
n_samples, n_features = X.shape
h,w=a[0].shape
n_classes=10
img=a
names=['0','1','2','3','4','5','6','7','8','9']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

print("n_samples: {}".format(n_samples))
print("n_features: {}".format(n_features))
print("n_classes: {}".format(n_classes))
print("Original Image Sizes {} by {}".format(h,w))
print (28*28) # the size of the images are the size of the feature vectors
n_samples: 6458
n_features: 784
n_classes: 10
Original Image Sizes 28 by 28
784

2.3 Visualize several images

Below are several images selected randomly.

In [8]:
import random
def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
    """Helper function to plot a gallery of portraits"""
    a=0
    b=len(X_train)
    plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        seq=random.randint(a,b)
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[seq].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[seq], size=12)
        plt.xticks(())
        plt.yticks(())
plot_gallery(X_train, y_train, h, w)

3.Data Reduction

3.1 linear dimensionality reduction

Perform linear dimensionality reduction of the images using principal components analysis.

In [9]:
from sklearn.decomposition import PCA

n_components = 300
print ("Extracting the top %d eigennumbers from %d numbers" % (
    n_components, X_train.shape[0]))

pca = PCA(n_components=n_components)
%time pca.fit(X_train.copy())
eigenfaces = pca.components_.reshape((n_components, h, w))
Extracting the top 300 eigennumbers from 4843 numbers
Wall time: 498 ms
In [10]:
def plot_explained_variance(pca):
    import plotly
    from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis, Bar, Line
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })
In [11]:
import plotly
plotly.offline.init_notebook_mode()

Visualize the explained variance of each component.

In [12]:
plot_explained_variance(pca)

In this picture, we can learn that at least 150 components can adequately represent the image data. In our lab, we choosed 300 components. And the results are as below.

In [13]:
def plot_gallery2(images, titles, h, w, n_row=3, n_col=6):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())


eigenface_titles = ["eigennumber %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery2(eigenfaces, eigenface_titles, h, w)

Above are some "base" images, which means they are only made by the most top useful eigen numbers.

In [14]:
n_components = 300
print ("Extracting the top %d eigennumbers from %d numbers" % (
    n_components, X_train.shape[0]))

rpca = PCA(n_components=n_components,svd_solver='randomized')
%time rpca.fit(X_train.copy())
eigenfaces = rpca.components_.reshape((n_components, h, w))
Extracting the top 300 eigennumbers from 4843 numbers
Wall time: 523 ms
In [15]:
eigenface_titles = ["eigennumber %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery2(eigenfaces, eigenface_titles, h, w)

Above are also some "base" images,which means they are only made by random eigen numbers,and they are used to reconstruct images.

3.2 Non-linear dimensionality reduction

Perform non-linear dimensionality reduction of the image data.

In [16]:
%%time
from sklearn.decomposition import KernelPCA

n_components = 400
# print(np.sum(~np.isfinite(X)))
print ("Extracting the top %d eigennumbers from %d digital" % (n_components, X_train.shape[0]))

kpca = KernelPCA(n_components=n_components, kernel='rbf', 
                fit_inverse_transform=True, gamma=15, # very sensitive to the gamma parameter,
                remove_zero_eig=True)  
kpca.fit(X_train.copy())
Extracting the top 400 eigennumbers from 4843 digital
Wall time: 11.1 s
In [17]:
import warnings
warnings.simplefilter('ignore', DeprecationWarning)

from ipywidgets import widgets  # make this interactive!
# compare the different methods

def plt_reconstruct(idx_to_reconstruct):
    idx_to_reconstruct = np.round(idx_to_reconstruct)
    
    reconstructed_image = pca.inverse_transform(pca.transform(X_train[idx_to_reconstruct].reshape(1, -1)))
    reconstructed_image_rpca = rpca.inverse_transform(rpca.transform(X_train[idx_to_reconstruct].reshape(1, -1)))
    reconstructed_image_kpca = kpca.inverse_transform(kpca.transform(X_train[idx_to_reconstruct].reshape(1, -1)))
    
    
    plt.figure(figsize=(15,7))
    
    plt.subplot(1,4,1)
    plt.imshow(X_train[idx_to_reconstruct].reshape((h, w)), cmap=plt.cm.gray)
    plt.title(y_train[idx_to_reconstruct])
    plt.grid()
    
    plt.subplot(1,4,2)
    plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
    plt.title('Full PCA')
    plt.grid()
    
    plt.subplot(1,4,3)
    plt.imshow(reconstructed_image_rpca.reshape((h, w)), cmap=plt.cm.gray)
    plt.title('Randomized PCA')
    plt.grid()
    
    plt.subplot(1,4,4)
    plt.imshow(reconstructed_image_kpca.reshape((h, w)), cmap=plt.cm.gray)
    plt.title('Kernel PCA')
    plt.grid()
    
widgets.interact(plt_reconstruct,idx_to_reconstruct=(0,n_samples-1,1),__manual=True)
Out[17]:
<function __main__.plt_reconstruct>

There are reconstructed images by different methods.

3.3 Comparation

In this lab, intuitively, we prefer Full PCA because it runs much faster than Kernel PCA, and the reconstructed image is pretty good.

Below is the quantitative comparison. Build a nearest neighbor classifier to see actual classification performance. Every image finds the 10 nearest distance images, computes the times of label for each image. Then use the highest occurrences label to compare with the label of the original image.

In [66]:
%time X_train_pca = pca.transform(X_train)
%time X_train_rpca =rpca.transform(X_train)
%time X_train_kpca = kpca.transform(X_train)

%time dist_matrix_pca = pairwise_distances(X_train_pca)
%time dist_matrix_rpca= pairwise_distances(X_train_rpca)
%time dist_matrix_kpca= pairwise_distances(X_train_kpca)
Wall time: 35.6 ms
Wall time: 35.1 ms
Wall time: 1.51 s
Wall time: 274 ms
Wall time: 264 ms
Wall time: 282 ms
In [67]:
idx=0
y_pre=[]
for j in range(len(X_train)):
    pre=[0,0,0,0,0,0,0,0,0,0]
    distance=copy.deepcopy(dist_matrix_pca[j,:])
    for i in range(10):
        distance[idx]=np.infty
        idx=np.argmin(distance)
        distance[idx]=100
        pre[y_train[idx]]=pre[y_train[idx]]+1
    y_pre.append(pre.index(max(pre)))
count=0
for i in range(len(X_train)):
    if y_train[i]==y_pre[i]:
        count=count+1
print(count/len(X_train))
0.9401197604790419
In [68]:
idx=0
y_pre=[]
for j in range(len(X_train)):
    pre=[0,0,0,0,0,0,0,0,0,0]
    distance=copy.deepcopy(dist_matrix_rpca[j,:])
    for i in range(10):
        distance[idx]=np.infty
        idx=np.argmin(distance)
        distance[idx]=100
        pre[y_train[idx]]=pre[y_train[idx]]+1
    y_pre.append(pre.index(max(pre)))
count=0
for i in range(len(X_train)):
    if y_train[i]==y_pre[i]:
        count=count+1
print(count/len(X_train))
0.940326244063597
In [76]:
idx=0
y_pre=[]
for j in range(len(X_train)):
    pre=[0,0,0,0,0,0,0,0,0,0]
    distance=copy.deepcopy(dist_matrix_kpca[j,:])
    for i in range(10):
        distance[idx]=np.infty
        idx=np.argmin(distance)
        distance[idx]=100
        pre[y_train[idx]]=pre[y_train[idx]]+1
    y_pre.append(pre.index(max(pre)))
count=0
for i in range(len(X_train)):
    if y_train[i]==y_pre[i]:
        count=count+1
print(count/len(X_train))
0.29506504232913483

The results of actual classification performance also support our preference.

3.4 Feature extraction

Perform feature extraction upon the images using DAISY feature extraction technique.

In [18]:
from skimage.io import imshow

idx_to_reconstruct = int(np.random.rand(1)*len(X))
img1  = X_train[idx_to_reconstruct].reshape((h,w))
imshow(img1)
plt.grid()
In [19]:
from skimage.feature import daisy

# lets first visualize what the daisy descripto looks like
features, img_desc = daisy(img1,step=40, radius=10, rings=3, histograms=5, orientations=8, visualize=True)
imshow(img_desc)
plt.grid()
In [20]:
features = daisy(img1,step=10, radius=10, rings=2, histograms=4, orientations=8, visualize=False)
print(features.shape)
print(features.shape[0]*features.shape[1]*features.shape[2])
(1, 1, 72)
72
In [21]:
def apply_daisy(row,shape):
    feat = daisy(row.reshape(shape),step=10, radius=10, rings=2, histograms=6, orientations=8, visualize=False)
    return feat.reshape((-1))

%time test_feature = apply_daisy(X_train[3],(h,w))
test_feature.shape
Wall time: 3.01 ms
Out[21]:
(104,)
In [22]:
%time daisy_features = np.apply_along_axis(apply_daisy, 1, X_train, (h,w))
print(daisy_features.shape)
Wall time: 14 s
(4843, 104)
In [73]:
from sklearn.metrics.pairwise import pairwise_distances
%time dist_matrix = pairwise_distances(daisy_features)
Wall time: 234 ms

Compute the pairwise distance between all the different image features, and then compute the nearest distance to find the closest imange for the original image.

In [24]:
import copy
# find closest image to current image
idx1 = 5
distances = copy.deepcopy(dist_matrix[idx1,:])
distances[idx1] = np.infty
idx2 = np.argmin(distances)

plt.figure(figsize=(7,10))
plt.subplot(1,2,1)
imshow(X_train[idx1].reshape((h,w)))
plt.title("Original Image")
plt.grid()

plt.subplot(1,2,2)
imshow(X_train[idx2].reshape((h,w)))
plt.title("Closest Image")
plt.grid()
In [25]:
from ipywidgets import fixed
# put it together inside a nice widget
def closest_image(dmat,idx1):
    distances = copy.deepcopy(dmat[idx1,:]) # get all image diatances
    distances[idx1] = np.infty # dont pick the same image!
    idx2 = np.argmin(distances)
    
    distances[idx2] = np.infty
    idx3 = np.argmin(distances)
    
    plt.figure(figsize=(10,16))
    plt.subplot(1,3,1)
    imshow(X_train[idx1].reshape((h,w)))
    plt.title("Original Image "+names[y_train[idx1]])
    plt.grid()

    plt.subplot(1,3,2)
    imshow(X_train[idx2].reshape((h,w)))
    plt.title("Closest Image  "+names[y_train[idx2]])
    plt.grid()
    
    plt.subplot(1,3,3)
    imshow(X_train[idx3].reshape((h,w)))
    plt.title("Next Closest Image "+names[y_train[idx3]])
    plt.grid()
    
widgets.interact(closest_image,idx1=(0,len(X_train)-1,1),dmat=fixed(dist_matrix),__manual=True)
Out[25]:
<function __main__.closest_image>

3.5 Nearest neighbor classifier

Build a nearest neighbor classifier to see actual classification performance.

Every image finds the 10 nearest distance images, computes the times of label for each image. Then use the highest occurrences label to compare with the label of the original image.

In [74]:
idx=0
y_pre=[]
for j in range(len(X_train)):
    pre=[0,0,0,0,0,0,0,0,0,0]
    distance=copy.deepcopy(dist_matrix[j,:])
    for i in range(10):
        distance[idx]=np.infty
        idx=np.argmin(distance)
        distance[idx]=100
        pre[y_train[idx]]=pre[y_train[idx]]+1
    y_pre.append(pre.index(max(pre)))
count=0
for i in range(len(X_train)):
    if y_train[i]==y_pre[i]:
        count=count+1
print(count/len(X_train))
0.9558125129052241

In this lab, the correct rate is 95.58%, so this classifier performs pretty well.

3.6 Exceptional Work - Gabor Kernels for Feature Extraction

In [28]:
from skimage.filters import gabor_kernel
from scipy import ndimage as ndi
from scipy import stats

# prepare filter bank kernels
kernels = []
for theta in range(4):
    theta = theta / 4. * np.pi
    for sigma in (1, 3):
        for frequency in (0.05, 0.25):
            kernel = np.real(gabor_kernel(frequency, theta=theta,
                                          sigma_x=sigma, sigma_y=sigma))
            kernels.append(kernel)

            
# compute the filter bank and take statistics of image
def compute_gabor(row, kernels, shape):
    feats = np.zeros((len(kernels), 4), dtype=np.double)
    for k, kernel in enumerate(kernels):
        filtered = ndi.convolve(row.reshape(shape), kernel, mode='wrap')
        _,_,feats[k,0],feats[k,1],feats[k,2],feats[k,3] = stats.describe(filtered.reshape(-1))
        # mean, var, skew, kurt
        
    return feats.reshape(-1)

idx_to_reconstruct = int(np.random.rand(1)*len(X_train))

gabr_feature = compute_gabor(X_train[idx_to_reconstruct], kernels, (h,w))
gabr_feature
Out[28]:
array([  3.18647959e+01,   2.91314389e+03,   1.55881207e+00,
         9.47878695e-01,   4.86352041e+01,   7.71765475e+03,
         1.66874123e+00,   1.08528114e+00,   5.14438776e+01,
         5.88237233e+03,   2.06908661e+00,   2.81089864e+00,
         8.65561224e+01,   1.35176239e+04,   6.82989340e-01,
        -1.52137811e+00,   3.18494898e+01,   2.90009737e+03,
         1.55334001e+00,   9.29827441e-01,   4.40573980e+01,
         7.05931343e+03,   1.87640662e+00,   1.86085273e+00,
         3.58852041e+01,   3.27976714e+03,   3.07386344e+00,
         9.10138717e+00,   6.66135204e+01,   1.20875503e+04,
         1.09224123e+00,  -8.01715017e-01,   3.18660714e+01,
         2.88328855e+03,   1.54420672e+00,   8.93276609e-01,
         3.81415816e+01,   6.05657380e+03,   2.16764203e+00,
         3.12801613e+00,   3.49974490e+01,   3.12193358e+03,
         3.18432849e+00,   9.84957645e+00,   5.68571429e+01,
         1.09246284e+04,   1.33755992e+00,  -2.06719325e-01,
         3.18647959e+01,   2.89691401e+03,   1.54948428e+00,
         9.10785510e-01,   4.17818878e+01,   6.67867395e+03,
         1.97616629e+00,   2.28008264e+00,   3.42117347e+01,
         2.96196788e+03,   3.25458517e+00,   1.05563059e+01,
         7.44579082e+01,   1.28808871e+04,   9.20305728e-01,
        -1.14773956e+00])
In [29]:
# takes ~3 minutes to run entire dataset
%time gabor_stats = np.apply_along_axis(compute_gabor, 1, X_train, kernels, (h,w))
print(gabor_stats.shape)
Wall time: 1min 1s
(4843, 64)
In [30]:
from sklearn.metrics.pairwise import pairwise_distances
# find the pairwise distance between all the different image features
%time dist_matrix_gabor = pairwise_distances(gabor_stats)
Wall time: 208 ms
In [31]:
widgets.interact(closest_image,idx1=(6),dmat=fixed(dist_matrix_gabor),__manual=True)
Out[31]:
<function __main__.closest_image>
In [33]:
print(dist_matrix_gabor.shape)
(4843, 4843)
In [34]:
y_pre=[]
for j in range(len(X_train)):
    pre=[0,0,0,0,0,0,0,0,0,0]
    distance=copy.deepcopy(dist_matrix_gabor[j,:])
    for i in range(10):
        distance[idx]=np.infty
        idx=np.argmin(distance)
        distance[idx]=100
        pre[y_train[idx]]=pre[y_train[idx]]+1
    y_pre.append(pre.index(max(pre)))
count=0
for i in range(len(X_train)):
    if y_train[i]==y_pre[i]:
        count=count+1
print(count/len(X_train))
0.726409250464588

In this part, we use Gabor Kernels for feature extraction, the result is 72.64%, which is worse than the result in 3.5. We think that it may due to the type of images. Different type of images have different distributions of pixels, pixels of our number images are characteristic, so we should grasp several methods and use them in different conditions.